ONNX Runtime at the Edge: Portable, Fast Inference

Placa electrónica con chip central iluminado representando inferencia ML en hardware

Deploying a machine-learning model outside the notebook where it was trained is usually where the fantasy breaks. You train in PyTorch on a cloud GPU, and suddenly you need to serve inference on a Linux server, inside an iOS app, on an ARM industrial gateway and —if product gets creative— in a tab of the customer’s browser. Each destination traditionally brings its own runtime: TensorFlow Serving here, Core ML there, TensorFlow Lite for Android, TensorFlow.js for the browser. With luck, the same model; with a lot of luck, the same numerical precision.

ONNX Runtime —the Microsoft-driven multiplatform inference engine— is what most teams arrive at when that pain becomes chronic. It turns ONNX (Open Neural Network Exchange) from a spec into a usable tool: export once from PyTorch or TensorFlow and run almost anywhere with the same artifact. In version 1.17, current as of March 2024, it is mature enough to be a reasonable default for most edge deployments. But it isn’t free, and it pays to understand what you gain and what you give up.

What ONNX actually solves

The problem is not purely technical, it’s organisational. An ML team training in PyTorch and a mobile team integrating on iOS speak different languages. Without a bridge format, each new target adds weeks of re-engineering: convert the graph, validate outputs match within tolerance, rediscover which operators aren’t supported, rewrite the preprocessing pipeline.

ONNX cuts that knot by proposing an open intermediate format —a computational graph with standard operators versioned by opset. ONNX Runtime is the reference implementation that executes it: a single .onnx artifact that works for server, mobile, browser and edge without duplicating tooling.

If the team lives exclusively inside one ecosystem, the native alternative tends to perform better. TensorRT extracts more from an NVIDIA GPU; Core ML squeezes an A17 Pro better; TFLite with XNNPACK is still very competitive on Android. But as soon as a second target appears, the cost of maintaining two or three native runtimes exceeds whatever you lost by picking the common denominator.

Export, the step that looks easy

All of ONNX Runtime’s value depends on the export from the source framework working correctly. In PyTorch that means torch.onnx.export with a sample input, tensor names and —critically— explicit declaration of which axes are dynamic:

import torch

dummy_input = torch.randn(1, 3, 224, 224)

torch.onnx.export(
    model.eval(),
    dummy_input,
    "model.onnx",
    input_names=["input"],
    output_names=["output"],
    dynamic_axes={
        "input": {0: "batch_size"},
        "output": {0: "batch_size"},
    },
    opset_version=17,
)

Without dynamic_axes, the model is frozen to batch size 1 and breaks in production the moment a different batch arrives. opset_version determines which operators are available; 17 is a reasonable compromise in early 2024 —modern enough for recent models, old enough that every EP understands it.

The non-negotiable step after export is validation: feed the same tensor to the PyTorch model and to the ONNX Runtime session and compare with np.allclose at a conservative tolerance. Typical sources of discrepancy: poorly-translated custom operators, float32 vs float16 precision differences, and the perennial batch-norm issue between training and inference mode.

Execution Providers: where the performance lives

ONNX Runtime’s architecture separates the graph from the hardware through Execution Providers (EPs). The default EP is CPU —optimised via MLAS—, but the real argument of the runtime is the list that stacks on top: CUDA and TensorRT for NVIDIA, OpenVINO for Intel, DirectML for Windows with any GPU, ROCm for AMD on Linux, Core ML for Apple Silicon, NNAPI for Android, QNN for Snapdragon, WebGPU and WebAssembly for the browser.

Configuration is a prioritised list: the runtime tries the first, falls back to the next if the hardware isn’t there, and lands on CPU as the safety net. The same Python code runs inference on the datacenter GPU in development and on the edge CPU in production without touching a line.

The important nuance in March 2024 is the real state of NPU EPs. QNN for Snapdragon 8 Gen 3 exists and works but needs additional conversion and operator coverage is still incomplete. The NPUs of the Intel Core Ultra (Meteor Lake) are reached via OpenVINO and demos look good, but documentation trails the silicon. Apple’s Neural Engine is only accessible through Core ML, which ONNX Runtime invokes indirectly when it delegates the subgraph. They are there, but treating them as transparent accelerators is still optimistic.

Quantization and graph optimisation

At load time, ONNX Runtime applies automatic passes —operator fusion, constant folding, dead-node elimination— without intervention. The big jump in size and latency comes from quantization, which reduces weights and activations from float32 to INT8 (or INT4 for transformers) with quality loss usually below 1%.

Dynamic quantization is the cheap entry: one line, no calibration, acceptable for most CNNs. To get the full 4× in size and 2-4× in latency, the proper move is static quantization with a representative calibration dataset. On large models like Whisper or BERT, INT8 is often the difference between running on a decent phone and not running at all.

Browser, mobile and the real edge case

onnxruntime-web is the underrated piece of the catalogue. It runs ONNX models in the browser using WebGPU when available and WebAssembly as fallback. For image classification, detection, or a Whisper-tiny transcribing on the client, it removes the need for an inference service and the GPU bill that comes with it.

On mobile, ONNX Runtime Mobile produces .aar for Android and Swift/Objective-C frameworks for iOS, with React Native bindings and reasonably-maintained Flutter plugins. For 20-100 MB models, it is almost always the lowest-friction option. On embedded edge —Jetson, Raspberry Pi, ARM industrial gateways— the argument is portability: iterate on workstation with CUDA, validate on laptop CPU, deploy on Jetson Orin with the TensorRT EP without rewriting a thing.

My take

ONNX Runtime isn’t the fastest engine on any specific platform, and almost no single benchmark will crown it champion. It doesn’t pretend to. Its proposition is to absorb the complexity of heterogeneity: one artifact, one API, many targets, and enough performance margin on each to make portability pay.

The calculation is clear when the team has to serve on more than one platform or doesn’t want to lock itself into a cloud provider’s proprietary runtime. If the deployment is exclusively large NVIDIA GPUs in a datacenter, TensorRT or vLLM will probably win. If it’s exclusively iOS, native Core ML rides higher. As soon as a second target appears —and with PC NPUs, browser inference and mobile apps pushing that way, it happens sooner every year— the trade-off tilts by itself.

The honest caveat as of March 2024 is the state of NPUs: the commercial narrative runs ahead of the operational reality, and anyone betting that ONNX Runtime will magically abstract Hexagon or Neural Engine will crash into unsupported operators and partial coverage. The promise is being delivered, but at silicon pace, not tweet pace. For everything else —CPU, consumer GPU, browser, classic mobile— it’s already the sensible choice.

Entradas relacionadas